CDP workflow - 1st Extract

Possible plan for standardize the dataset:

  • Access raw data
  • Review raw data
  • Find missing data
  • Adjust the sample
  • De-identify data
  • Drop irrelevant columns
  • Split columns
  • Rename variables
  • Normalize variables
  • Standarize variables
  • Update variables types
  • Recode variables
  • Construct new variables
  • Add missing values
  • Add metadata
  • Validate data
  • Join data
  • Reshape data
  • Save clean data

Expected criteria that Data should meet on this process:

  1. Complete
  2. Valid
  3. Accurate
  4. Consistent
  5. De-identified
  6. Interpretable
  7. Analyzable

Let’s perform validation checks on that table:

Checks

Pointblank Validation
QA checks - 2. Is the data Valid?

tibble waste raw datasetWARN 0.00 STOP NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT

1
col_vals_not_null

Check that are no NULL/NA values in column

col_vals_not_null()

account_name

488 488
1.00
0
0.00


2
col_vals_not_null

Check that are no NULL/NA values in column

col_vals_not_null()

question_number

488 488
1.00
0
0.00


3
col_vals_not_null

Check that are no NULL/NA values in column

col_vals_not_null()

row_number

488 488
1.00
0
0.00


4
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

77 77
1.00
0
0.00


5
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

52 52
1.00
0
0.00


6
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

64 64
1.00
0
0.00


7
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

62 62
1.00
0
0.00


8
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

30 30
1.00
0
0.00


9
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

47 47
1.00
0
0.00


10
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

54 54
1.00
0
0.00


11
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

55 55
1.00
0
0.00


12
rows_distinct

Unique rows per indicator

rows_distinct()

account_name

47 47
1.00
0
0.00


13
col_is_numeric

Check that columns is numeric

col_is_numeric()

question_number

1 1
1.00
0
0.00


14
col_is_numeric

Check that columns is numeric

col_is_numeric()

row_number

1 1
1.00
0
0.00


15
col_is_numeric

Check that columns is numeric

col_is_numeric()

response_orig

1 0
0.00
1
1.00


16
col_is_numeric

Check that columns is numeric

col_is_numeric()

response_suggested

1 0
0.00
1
1.00


17
col_is_integer

Check that column in integer

col_is_integer()

year_orig

1 0
0.00
1
1.00


18
col_is_integer

Check that column in integer

col_is_integer()

year_suggested

1 0
0.00
1
1.00


19
col_vals_between

Check that year col is between 2000 and current 2024

col_vals_between()

year_orig

[2,000, 2,024]

488 488
1.00
0
0.00


20
col_vals_between

Check that year col is between 2000 and current 2024

col_vals_between()

year_suggested

[2,000, 2,024]

488 487
0.99
1
0.01


21
col_vals_regex

Check that values containg numbers only

col_vals_regex()

response_orig

^[0-9]

488 488
1.00
0
0.00


22
col_vals_regex

Check that values containg numbers only

col_vals_regex()

response_suggested

^[0-9]

488 487
0.99
1
0.01


23
col_vals_regex

Check that values containg numbers only

col_vals_regex()

year_orig

^[0-9]

488 488
1.00
0
0.00


24
col_vals_regex

Check that values containg numbers only

col_vals_regex()

year_suggested

^[0-9]

488 487
0.99
1
0.01


25
col_vals_equal

Check comments where the word wrong or probably/likely wrong is mentioned

col_vals_equal()

response_wrong

0

488 468
0.96
20
0.04


26
col_vals_equal

Check comments where the word wrong or probably/likely wrong is mentioned

col_vals_equal()

response_prob_likely_wrong

0

488 481
0.99
7
0.01

2024-06-27 16:01:32 -03 < 1 s 2024-06-27 16:01:33 -03

1. Is data Complete?

Check C40 city list agains CDP cities asked

Number of cities (responses):



Check columns thar seem to be merged (they appear as lists in the dataframe)



Check values as numbers, clean them and check if any value got lost

  • The number of columns in your data match the number of variables you have in your data dictionary (i.e., no variables were accidentally dropped). Similarly, there should be no unexpected missing data for variables (i.e., if the data was collected, it should exist in your dataset).